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Network topology plays a vital role in understanding the 
performance of network applications and protocols. Thus, 
recently there has been tremendous interest in generating 
realistic network topologies. Such work must begin with 
an understanding of existing network topologies, which to- 
day typically consists of a relatively small number of data 
sources. In this paper, we calculate an extensive set of im- 
portant characteristics of Internet AS-level topologies ex- 
tracted from the three data sources most frequently used by 
the research community: traceroutes, BGP, and WHOIS. 
We find that traceroute and BGP topologies are similar to 
one another but differ substantially from the WHOIS topol- 
ogy. We discuss the interplay between the properties of the 
data sources that result from specific data collection mech- 
anisms and the resulting topology views. We find that, 
among metrics widely considered, the joint degree distri- 
bution appears to fundamentally characterize Internet AS- 
topologies: it narrowly defines values for other important 
metrics. We also introduce an evaluation criteria for the ac- 
curacy of topology generators and verify previous observa- 
tions that generators solely reproducing degree distributions 
cannot capture the full spectrum of critical topological char- 
acteristics of any of the three topologies. Finally, we release 
to the community the input topology datasets, along with 
the scripts and output of our calculations. This supplement 
should enable researchers to validate their models against 
real data and to make more informed selection of topology 
data sources for their specific needs. 

1 Introduction 

Internet topology analysis and modeling has attracted sub- 
stantial attention recently QEIEllIIEllllllTllH). 1 Such an in- 
terest is not surprising since the Internet's topological prop- 
erties and their evolution are cornerstones of many practical 



'We intentionally avoid citing statistical physics literature, where the 
number of publications dedicated to the subject has exploded. For intro- 
duction and references see 19111 01. 



and theoretical network research agendas. Our own motiva- 
tion for this study is the need to construct accurate network 
emulation environments 1 1 1 1 that will enable development, 
reliable testing, and performance evaluation of new applica- 
tions, protocols, and routing architectures 1 10 1. Knowledge 
of realistic network topologies and the availability of tools 
to generate them are essential to this goal. We also seek 
to develop a methodology to compare topologies to one an- 
other based on relatively simple metrics. That is, we seek 
a set of metrics such that when two topologies demonstrate 
similar values for a particular property, they will be similar 
across a broad range of potential properties. 

There are a number of sources of Internet topology data, 
obtained using different methodologies that yield substan- 
tially different topological views of the Internet. Unfortu- 
nately, many researchers either rely only on one data source, 
sometimes outdated or incomplete, or mix disparate data 
sources into one topology. To date, there has been little 
attempt to provide a detailed analytical comparison of the 
most important properties of topologies extracted from the 
different data sources. 

Our study fills this gap by analyzing and explaining 
topological properties of Internet AS-level graphs extracted 
from the three commonly used data sources: (1) tracer- 
oute measurements [123; (2) BGP [ 13 1; and (3) the WHOIS 
database 1 14 1. This work makes three key contributions to 
the field of topology research: 

1 . We calculate a broad range of topology metrics consid- 
ered in the networking literature for the three sources 
of data. We reveal the peculiarities of each data source 
and the resulting interplay between artifacts of data 
collection and the key properties of the derived graphs. 

2. We highlight the interdependencies between a broad 
array of topological features and discuss their rele- 
vance when comparing Internet topologies to various 
random graph models that attempt to capture Inter- 
net topology characteristics. Our analysis shows that 
graph models that reproduce the joint degree distribu- 
tion of the graphs also capture other crucial topological 



1 



characteristics to best approximate the topology. 

3. To promote and simplify further analysis and discus- 
sion, we release [15| the following data and results 
to the community: a) the AS-graphs representing the 
topologies extracted from the raw data sources; b) the 
full set of data plots (many not included in the paper) 
calculated for all graphs; c) the data files associated 
with the plots, useful for researchers looking for other 
summary statistics or for direct comparisons with em- 
pirical data; and d) the scripts and programs we devel- 
oped for our calculations. 

We organize this paper as follows. Section describes 
our data sources and how we constructed AS-level graphs 
from these data. In Section [5] we present the set of topo- 
logical characteristics calculated from our graphs and ex- 
plain what they measure and why they are important. Sec- 
tion |4]compares properties of the observed topologies with 
classes of random graphs and discusses the accuracy crite- 
ria for topology generators. We discuss the limitations of 
our study in Section [5] We conclude in Section [6] with the 
summary of our findings. 

2 Construction of AS-level graphs 
2.1 Data sources 

We used the following data sources to construct AS-level 
graphs of the Internet: traceroute measurements, BGP data, 
and the WHOIS database. 

Traceroute 1161 is a tool that captures a sequence of IP 
hops along the forward path from the source to a given des- 
tination by sending either UDP or ICMP probe packets to 
the destination. 

CAIDA has developed a tool, skitter 1 12 1, to collect con- 
tinuous traceroute-based Internet topology measurements. 
AS-level topology graphs derived from the skitter data 
on a daily basis are available for download at |17|. For 
this study, we used the 31 daily graphs for the month of 
March 2004. The measurements contain multi-origin ASes 
(prefixes announced by different originating ASes) [18|, 
AS-sets 1191 . and private ASes 1201 . Both multi-origin 
ASes and AS-sets create ambiguous mapping between IP 
addresses and ASes, while private ASes create false links. 
Hence we filter AS-sets, multi-origin ASes, and private 
ASes from each graph, and we discard indirect links 1171 . 
We then merge the each daily graph to form one graph re- 
ferred to as the skitter graph throughout the rest of the pa- 
per. 

BGP (Border Gateway Protocol) 1 19 1 is the protocol used 
for routing among ASes in the Internet. Route Views |13| 
collects and archives both static snapshots of the BGP rout- 
ing tables and dynamic BGP data in the form of BGP mes- 
sage dumps (updates and withdrawals). Therefore, we de- 
rive two types of graphs from the BGP data for the same 



month of March 2004: one from the static tables (BGP ta- 
bles) and one from the updates (BGP updates). In both 
cases, we filter AS-sets and private ASes and merge the 31 
daily graphs into one. 

WHOIS 1 14 1 is a collection of databases containing a 
wide range of information useful to network operators. Un- 
fortunately, these databases are manually maintained with 
little requirements for updating the registered information 
in a timely fashion. RIPE's |2T]|22) WHOIS database con- 
tains the most reliable current topological information, al- 
though it covers primarily European Internet infrastructure. 

We obtained the RIPE WHOIS database dump for 
April 07, 2004. The records of interest to us are: 

aut-num: ASx 
import: from ASy 
export : to ASz 

which indicate links ASx-ASy and ASx-ASz. We con- 
struct an AS-level graph (referred to as WHOIS graph) from 
these data and exclude ASes that did not appear in the 
aut-num lines. Such ASes are external to the database 
and we cannot correctly estimate their topological proper- 
ties (e.g. node degree). We also filter private ASes. 

All four graphs constructed as described are available for 
download from 1151 . Overlap statistics of the graphs are 
shown in Tabled 

Comparing the two BGP-derived graphs, we note that the 
sets of their constituent nodes and links are similar. Given 
minor differences between node and link sets of the BGP 
table- and update-derived topologies, we, not surprisingly, 
found the metric values calculated for these two graphs to 
be close. Therefore, in the rest of this study we present 
characteristics of the static BGP-table graph only and refer 
to it as BGP graph. 2 

In constructing the skitter graph, we used BGP tables 
to map IP addresses observed in traceroutes to AS num- 
bers. Therefore the number of nodes seen by skitter but not 
by BGP should be 0. The one node difference (AS2277 
Ecuanet in skitter data) results from the fact that different 
BGP table dumps were used to construct the BGP-table 
graph and to map an IP address to this AS on the day when 
skitter observed this IP address in its traces. 

Based on the very method of their construction, the three 
graphs in this study reveal different sides of the actual Inter- 
net AS-level topology. The skitter graph closely reflects the 
topology of actual Internet traffic flows, i.e. the data plane. 
The BGP graph reveals the topology seen by the routing 
system, i.e. the control plane. However, both skitter and 
BGP are traceroute-like explorations of the network topol- 
ogy, meaning that we can try to approximate these graphs 
by a union of spanning trees rooted at, respectively, skitter 
monitors or BGP data collection points. As such, both these 

2 Plots and tables with metrics of the BGP-update graph included are 
available in the Supplement 1151 . 
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Table 1 : Comparison of graphs built from different data sources. The baseline graph Ga is the BGP-tables graph. 
Graph Gb is one the other graphs listed in the first row. 





BGP updates 


skitter 


WHOIS 


Number of nodes in both Ga and Gb (| Va D I) 


17,349 


9,203 


5,583 


Number of nodes in Ga but not in Gb (\Va \ Vb\) 


97 


8,243 


11,863 


Number of nodes in Gb but not in Ga (\Vb \ Va |) 


68 


1 


1,902 


Number of edges in both Ga and Gb {\Ea f] Eb ) 


38,543 


17,407 


12,335 


Number of edges in Ga but not in Gb (\Ea \Eb\) 


2,262 


23,398 


28,470 


Number of edges in Gb but not in Ga (\Eb \ Ea | ) 


3,941 


11,552 


44,614 



methods discover more radial links, that is, links connect- 
ing numerous low-degree nodes (e.g. customers ASes) to 
high-degree nodes (e.g. large ISP ASes). At the same time, 
these measurements fail to detect many tangential 3 links, 
that is, links between nodes of similar degrees. Traceroute- 
like methods are particularly unsuitable for discovering tan- 
gential links interconnecting medium-to-low degree nodes 
(e.g. lower-tier ASes) since many of these links do not lie on 
any shortest path rooted at a particular vantage point in the 
core. In contrast, WHOIS data contains abundant medium- 
degree tangential links as directly attached to sources of 
WHOIS records (values of aut-num fields). 

2.2 Statistical validity of our results 

Lakhina et al. |24| numerically explored sampling biases 
arising from traceroute measurements and found that such 
traceroute-sampled graphs of the Internet yield insufficient 
evidence for characterizing the actual underlying Internet 
topology. However, Dall'Asta et al. [25 1 convincingly re- 
fute their conclusions by showing that various traceroute 
exploration strategies provide sampled distributions with 
enough signatures to distinguish at the statistical level be- 
tween different topologies. The authors of |25| also argue 
that real mapping experiments observe genuine features of 
the Internet, rather than artifacts. These results lend credi- 
bility to our chosen traceroute-like data sources and imply 
that the real Internet topology is unlikely to be critically dif- 
ferent from the ones measured in skitter and BGP cases. 

The topology metrics we consider in Section [3] all show 
that the WHOIS topology is different from the other two 
graphs. Thus, the following question arises: Can we explain 
the difference by the fact that the WHOIS graph contains 
only a part of the Internet, namely European ASes? To an- 
swer this question we performed the following experiment. 
We considered the BGP-tables and WHOIS topologies nar- 
rowed to the set of nodes present both in BGP tables and 
WHOIS (cf. Tabled and compared the various topological 
characteristics for the full and the reduced graphs. Results 
of this comparison are available in the Supplement 1151 . 

3 The semantics behind the terms "radial" and "tangential" come from 
the skitter poster layout [23 , where high-degree nodes populate the center 
of a circle, while low degree nodes are close to the circumference. Links 
connecting high-degree nodes to low-degree nodes are indeed radial then. 



We found that the induced graphs preserve the full set of 
the properly normalized topological properties of the orig- 
inal graphs. Therefore, the differences between full BGP 
and WHOIS topologies are intrinsic to their originating data 
sources, and not due to geographical biases in sampling the 
Internet. 

3 Topology characteristics 

In this section, we quantitatively analyze differences be- 
tween the three graphs in terms of various topology met- 
rics. The set of metrics we discuss here encompasses most 
of the graph metrics considered relevant for topology in the 
networking literature 0] |5] |8] . Relative to most related 
work, we consider a broader array of metrics of interest. 

For each metric, we address the following points: 1) met- 
ric definition; 2) metric importance; and 3) discussion on 
the metric values for the three measured topologies. We 
present these results in the plots associated with every met- 
ric and in the master Tablef3]containing all the scalar metric 
values for all the three graphs. 

We start with simple and basic metrics that characterize 
local connectivity in a network. With increasing precision, 
we move on to more sophisticated metrics that describe 
global properties of the topology. The latter metrics play 
a vital role in the performance of network protocols and 
applications. Some metrics that we discuss here are not ex- 
actly equal but directly related to a topology characteristic 
deemed important in the networking literature. Where pos- 
sible, we illuminate the relationship between the metrics we 
consider and the ones that have been discussed in influential 
networking papers. We provide a summary of this mapping 
in Table 



Table 2: Important metric mappings. 



Previously defined metric 


Our definition 


Likelihood in |4| 


Assortativity coefficient 


Expansion in 1 3 | 


Distance 


Resilience in (3J 
Performance in 1 4 1 


Spectrum 


Link value in 1 3 1 
Router utilization in |4| 


Betweenness 
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3.1 Average degree 

Definition. The two most basic graph properties are the 
number of nodes n (also referred as graph size) and the 
number of links m. They define the average node de- 
gree k — 2m /n. 

Importance. Average degree is the coarsest connectivity 
characteristic of the topology. Networks with higher k are 
"better-connected" on average and, consequently, are likely 
to be more robust. Detailed topology characterization based 
only on the average degree is rather limited, since graphs 
with the same average node degree can have vastly different 
structures. 

Discussion. BGP sees almost twice as many nodes as 
skitter (Table[3}. The WHOIS graph is smallest, but its av- 
erage degree is almost three times larger than that of BGP, 
and ^2.5 times larger than that of skitter. In other words, 
WHOIS contains substantially more links, both in the abso- 
lute (m) and relative (k) senses, than any other data source, 
but credibility of these links is lowest (cf. Section ini : there 
have been reports about some ISPs that tend to enter inaccu- 
rate information in the WHOIS database in order increase 
their "importance" in the Internet hierarchy |21 1. 

Graphs ordered by increasing average degree k are BGP, 
skitter, WHOIS. We call this order the k-order. 

3.2 Degree distribution 

Definition. Let n(k) be the number of nodes of de- 
gree k (fc-degree nodes). The node degree distribution is 
the probability that a randomly selected node is fc-degree: 
P(k) = n(k)/n. The degree distribution contains more in- 
formation about connectivity in a given graph than the av- 
erage degree, since given a specific form of P(k) we can 
always restore the average degree by k = J2k=T kP(k), 
where k max is the maximum node degree in the graph. If 
the degree distribution in a graph of size n is a power law, 
P(k) ~ fc~ 7 , where 7 is a positive exponent, then P(k) 
has a natural cut-off at the power-law maximum de- 
gree 0: k^ L ax = ^(7-1). 

Importance. The degree distribution is the most fre- 
quently used topology characteristic. The observation 
that the Internet's degree distribution follows power law had 
significant impact on network topology research: Internet 
models before 1 1 1 failed to exhibit power laws. Since power 
law is a highly variable distribution, node degree is an im- 
portant attribute of an individual node. For example, we can 
use AS degrees as the simplest way to rank ASes |26|. 

Discussion. As expected, the degree distribution PDFs 
and CCDFs in Figure^are in the fc-order (BGP < skitter < 
WHOIS) for a wide range of node degrees. 

Comparing the observed maximum node degrees k max 
with those predicted by the power law k^ x in Table|3j we 
conclude that skitter is closest to power law. The power- 
law approximation for the BGP graph is less accurate. The 
WHOIS graph has an excess of medium degree nodes and 



its node degree distribution does not follow power law at 
all. It is not surprising then that augmenting the BGP graph 
with WHOIS links breaks the power law characteristics of 
the BGP graph l2ll22l. 

Note that there are fewer 1 -degree nodes than 2-degree 
nodes in all the graphs (cf. Figure |l(a)) . This effect is 
due to the AS number assignment policies |20| allowing 
a customer to have an AS number only if it has multiple 
providers. If these policies were strictly enforced, then the 
minimum AS degree would be 2. 

CCDFs of skitter and BGP graphs look rather similar 
(Figure |T(b)| >, but Table^shows significant differences be- 
tween the two graphs, in terms of (non-)intersecting nodes 
and links. We seek to answer the question of where, topo- 
logically, these nodes and links are located. Calculating the 
degree distribution of nodes present only in the BGP graph 
(Figure [T(c)t , we detect a skew towards low-degree nodes. 
The average degree of the nodes that are present only in 
BGP graphs, and not in skitter is 1.86. Skitter's target list 
of destinations to probe does not contain any replying IP ad- 
dress in the address blocks advertised by these small ASes. 
As a result, the skitter graph misses them. 

Most links present only in BGP, but not in skitter, are 
tangential links between low-degree ASes (see [151 for de- 
tails). The majority of such links connect the low-degree 
ASes present only in BGP to their secondary (backup) low- 
degree providers, while their primary providers are of high 
degrees. Even if skitter detects a low-degree AS having 
such a small backup provider, this tool is still unlikely to 
detect the backup link since its traceroutes follow the pri- 
mary path via the large provider. 

3.3 Joint degree distribution 

While the node degree distribution tells us how many nodes 
of given degree are in the network, it fails to provide 
information on the interconnection between these nodes: 
given P(k), we still do not know anything about the struc- 
ture of the neighborhood of the average node of a given de- 
gree. The joint degree distribution (or degree-degree corre- 
lation matrix) fills this gap by providing information about 
nodes' 1 -hop neighborhoods. 

Definition. Let m(ki,k2) be the total number of 
edges connecting nodes of degrees k\ and &2- The 
joint degree distribution (JDD) is the probability that 
a randomly selected edge connects k\- and ^-degree 
nodes: P(k\^k2) ~ mikx^k^jm.^ Note that P(ki, £2) 
is different from the conditional probability P(k2\ki) = 
k/kiP(ki, k2)/P(ki) that a given fci-degree node is con- 
nected to a ^-degree node. The JDD contains more in- 
formation about the connectivity in a graph than the degree 
distribution, since given a specific form of P(k\ , ^2) we can 

4 The exact definition for undirected graphs differentiates (by a fac- 
tor 1/2) between the k\ = &2 and ti ^ cases. 
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Figure 1: Node degree distributions P(k). 



always restore both the degree distribution P(k) and aver- 
age degree k by expressions in |9|. The JDD is a function 
of two arguments. A summary statistic of JDD, that is a 
function of one argument is called the average neighbor 
connectivity k nn (k) — J2k"=i k'P{k'\k). It is simply the 
average neighbor degree of the average fc-degree node. It 
shows whether ASes of a given degree preferentially con- 
nect to high- or low-degree ASes. In a full mesh graph, 
knn(k) reaches its maximal possible value n — 1. There- 
fore, for uniform graph comparison we plot normalized val- 
ues k nn {k)/(n — 1). We can further summarize JDD by a 
single scalar called assortativity coefficient r I27ll28l . 

Importance. As opposed to the degree distribution, the 
network community has recently started recognizing the 
importance on JDD B29l l6l. The most prominent recent ex- 
ample defines likelihood |4| — the central metric for their 
argument — as a metric directly related to the assortativity 
coefficient. They propose to use likelihood as a measure 
of randomness differentiating between multiple graphs with 
the same degree distribution. Such a measure is important 
for evaluating the amount of order (e.g. engineering design 
constraints) present in a given topology. A topology with 
low likelihood is not random, it is a result of some sophis- 
ticated evolution processes involving specific design pur- 
poses. We actively use the JDD in the described fashion in 
Sectiong] 

The assortativity coefficient r ( -1 ^ r ^ 1) has direct 
practical implications. Disassortative networks with r < 
have an excess radial links connecting nodes of dissimilar 
degrees. Such networks are vulnerable to both random fail- 
ures and targeted attacks. Viruses spread faster in these 
topologies. On a positive side, vertex cover in disassorta- 
tive graphs is smaller, which is important for applications 
such as traffic monitoring 1 30 1 and prevention of DoS at- 
tack ll3ll . The opposite properties apply to assortative net- 
works with r > that have an excess of tangential links 
connecting nodes of similar degrees. 



Discussion. All the three Internet graphs built from our 
data sources are disassortative (r < 0) as seen in Tabled 
We call the order of graphs with decreasing assortativity co- 
efficient r — WHOIS, BGP, skitter — the r-order. The most 
disassortative graph is skitter, that has the largest excess 
of radial links. The least disassortative graph is WHOIS. 
The r-order can be explained in terms of differing topology 
measurement methodologies. As described in Section[2] the 
traceroute-like explorations of BGP and skitter data fail to 
detect tangential links, thus causing the graphs to be dis- 
assortative. The WHOIS graph's collection methodology 
however finds abundant medium-degree tangential links, re- 
sulting in the graph's higher assortative value. 

The interplay between k- and r-orders underlies Figure|2] 
where we show the average neighbor connectivity functions 
for the three graphs. Skitter has the largest excess of ra- 
dial links that connect low-degree nodes (customers ASes) 
to high-degree nodes (large provider ASes). The high ra- 
dial links are responsible for skitter's highest average de- 
gree for the neighbors of low-degree nodes: in Figure |2] 
skitter is at the top in the area of low degrees, which fol- 
lows the r-order. On the other hand, the greatest propor- 
tion of tangential links between ASes of similar degrees in 
WHOIS graph contributes to connectivity of neighbors of 
high-degree nodes; therefore the WHOIS graph is at the top 
for high degree nodes (fc-order). 

Note that in the case of skitter and BGP, k nn (k) can be 
approximated by a power law with the corresponding expo- 
nents 7„„ in Tabled 

3.4 Clustering 

While the JDD contains information about the degrees of 
neighbors for the average fc-degree node, it does not tell us 
how these neighbors interconnect. Clustering satisfies this 
need by providing a measure of how close a node's neigh- 
bors are to forming a clique. 

Definition. Let m nn (k) be the average number of links 
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Figure 2 : Normalized average neighbor con- 
nectivity k nn (k)/(n — 1). 
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Figure 3: Local clustering C(k). 
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Figure 4: Rich club connectivity 0(p/n). 



between the neighbors of fc-degree nodes. Local clustering 
is the ratio of this number to the maximum possible such 
links: C(k) = 2fh nn (k) / k / (k — 1). If two neighbors of a 
node are connected, then these three nodes together form 
a triangle (3-cycle). Therefore, by definition, local cluster- 
ing is the average number of 3-cycles involving fc-degree 
nodes. The two summary statistics associated with local 
clustering are mean local clustering C = ^C(fc)P(fc), 
which is the average value of C(fc), and the clustering co- 
efficient C, which is the percentage of 3-cycles among all 
connected node triplets in the entire graph (for exact defini- 
tion, see 1 32 1). 

Importance. Similar to the JDD, one can use cluster- 
ing as a litmus test for verifying the accuracy of a topology 
model or generator |5|. Clustering is a basic connectivity 
characteristic. Therefore, if a model reproduces clustering 
incorrectly, it is likely to be less accurate for a variety of 
graph characteristics. We use clustering to verify the effi- 
cacy of topology models in Section!?] 

Clustering is practical because it expresses local robust- 
ness in the graph: the higher the local clustering of a node, 
the more interconnected are its neighbors, thus increasing 
the path diversity locally around the node. Virus outbreaks 
spread faster in high-clustered networks, although outbreak 
sizes are smaller 1331 . Networks with strong clustering are 
likely to be chordal or of low chordality, 5 which makes cer- 
tain routing strategies perform better |34|. 

Discussion. We first observe that the clustering average 
values C in Tableware in the fc-order, which is expected: 
more the links, more the clustering. The values of C are al- 
most equal for skitter and WHOIS, but the clustering coef- 
ficient C is 15 times larger for WHOIS than for skitter. As 
shown in |35|, orders of magnitude difference between C 
and C is intrinsic to highly disassortative networks and is a 
consequence of degree correlations (JDD). 

Similarly to fc„„(fc), the interplay between k- and r- 
orders explains Figure [5] where we plot local clustering 
as a function of node degree C(fc). For low degree nodes, 



5 Chordality of a graph is the length of the longest cycle without chords. 
A graph is called chordal is its chordality is 3. 



skitter's clustering is the highest amongst the three graphs 
because skitter graph is most disassortative. The links ad- 
jacent to low-degree nodes are most likely to lead to high- 
degree nodes, the latter being interconnected with a high 
probability. For high degree nodes, the WHOIS graph ex- 
hibits highest values for clustering since this graph has the 
highest average connectivity (largest fc). The neighbors of 
high-degree nodes are interconnected to a greater extent re- 
sulting in higher clustering for such nodes. 

Similar to fc„„(fc), C(fc) also can be approximated by a 
power law for skitter and BGP graphs (exponents jc m Ta- 
ble|3}. 

JDDs with strong correlations play a major part for the 
presence of non-trivial clustering observed in many net- 
works |35|. This interplay explains overall similarity be- 
tween degree correlations and clustering, in general, and 
similarity between k nn (k) and C(fc), in particular. 



3.5 Rich club connectivity 

Definition. Let p = 1 ... n be the first p nodes ordered by 
their non-increasing degrees in a graph of size n. Rich club 
connectivity (RCC) 4>(p/n) is the ratio of the number of 
links in the subgraph induced by the p largest-degree nodes 
to the maximum possible links p(p — l)/2. In other words, 
the RCC is a measure of how close p-induced subgraphs are 
to cliques. 

Importance. As of this writing, one of the more success- 
ful Internet AS -level topology model is the Positive Feed- 
back Preference (PFP) model by Zhou and Mondragon |8 1. 
It accurately reproduces a wide spectrum of metrics of the 
measured AS-level topology by trying to explicitly capture 
only the following three characteristics: (i) the exact form 
of the node degree distribution; (ii) the maximum node de- 
gree; and (iii) RCC. The success of the PFP model in ap- 
proximating the real topology is yet to be fully explained. 
One can show that networks with the same JDDs have the 
same RCC. The converse is not true, but one can fully de- 
scribe all the JDDs having a given form of RCC. 

Discussion. As expected, the values of 4>{p/n) in Fig- 




Node degree 

Figure 5: Average coreness of fc-degree nodes «(k). 

ure|4]are in the fc-order with WHOIS at the top: more links 
result in denser cliques. RCC exhibits clean power laws 
for all three graphs in the area of medium and large p/n. 
The values of the power-law exponents 7 rc in Table[3]result 
from fitting <fi(p/n) with power laws for 90% of the nodes, 
0.1 s$ p/n 1. 

3.6 Coreness 

Definition. There are two definitions of coreness. In graph- 
theoretic literature 1361 . the fc-core of a graph is the sub- 
graph obtained from the original graph after removal of all 
nodes of degree less than or equal to fc. A more informa- 
tive definition of fc-core 1 7 1 is the subgraph obtained from 
the original graph by the iterative removal of all nodes of 
degree less than or equal to fc. 6 We use the latter defini- 
tion. The node coreness n of a given node is then the max- 
imum fc such that this node is still present in the fc-core 
but removed in the (fc + l)-core. The minimum node core- 
ness in a given graph is K min = k min - 1, where k mm is 
the lowest node degree present. All 1 -degree nodes have 
k = 0. The maximum node coreness K max in a graph, 
or the graph coreness, is such that the K max -core is not 
empty, but (K max + l)-core is. For example, coreness of 
a tree is and coreness of a fc-regular graph 1 37 1 is equal 
to coreness of all of its nodes (all having degree fc), which 
is fc — 1. We further define the graph core as its K max -coie, 
and the graph fringe as the set of nodes with minimum 
coreness n m i n . Note that because the process of building 
core is iterative, nodes with degree fc > K m i n can be in the 
fringe. 

Importance. The node coreness tells us how "deep in the 
core" the node is. It is a much more sophisticated measure 
of node connectivity than node degree. Indeed, the node 
degree can be high, but if its coreness is small, then the 
node is not well connected and one can easily disconnect 

6 Remove all nodes of degree $C k, then do it again in the remaining 
graph, proceed until all remaining nodes are of degrees > k. 



it by removing its poorly connected neighbors. For exam- 
ple, a high-degree hub of a star has coreness of 0. At the 
same time, node coreness is not a measure of centrality of 
the node. For example, a low-degree node interconnecting 
a few high-degree hubs has a low value of coreness, but 
intuitively it is in the "center of the graph." At the same 
time, coreness is important for topology visualization capa- 
ble of revealing network architectural fingerprints |38| and 
signatures of topology dynamics under different types of 
anomalies (worm and DoS attacks, outages, misconfigura- 
tions, etc.) |7 1. 

Discussion. The average node coreness in Table [3] is 
in the fc-order, which is expected. The graph coreness of 
WHOIS is more that three times larger than of skitter and 
BGP. WHOIS has particularly large core size and graph 
coreness because the r-order amplifies the fc-order in this 
case: WHOIS has highest link density (largest fc) and high- 
est concentration of them in the core (largest r). WHOIS 
graph has the largest relative core size and smallest relative 
fringe size (cf. Table [3]l . The BGP graph is the sparsest, 
having the smallest relative core size and the largest rela- 
tive fringe size. Interestingly, in the BGP graph, nodes with 
degree as low as 34 are in the core, and nodes with degree as 
high as 7 are in the fringe. For all three graphs, the average 
node coreness as a function of node degree n(k) roughly 
follows power laws for fc < 100 (Figure The corre- 
sponding exponents and mean coreness are given in Table[5] 
For nodes with degrees fc > 100 the coreness reaches satu- 
ration: increasing node degree above 100 does not increase 
coreness. 

3.7 Distance 

Definition. The shortest path length distribution or simply 
the distance distribution d(x) is the probability for a ran- 
dom pair of nodes to be at a distance x hops from each 
other. Two basic summary statistics associated with the dis- 
tance distribution of a graph are average distance d and the 
standard deviation a. We call the latter the distance dis- 
tribution width since distance distribution in Internet graphs 
(and in many other networks) has a characteristic Gaussian- 
like shape. 

Eccentricity is an extreme form of distance: if dij is dis- 
tance between nodes i and j, then eccentricity £; of node i 
is the maximum distance from i |37|: £; = max., dy. The 
maximum eccentricity in a graph is also the maximum dis- 
tance and is called the graph diameter D = e max , and the 
minimum eccentricity R = e m ,„ is called the graph radius. 
The set of nodes with maximum eccentricity forms graph 
periphery, while nodes with minimum eccentricity belong 
to graph center l37l . 

Importance. Distance distribution is critically important 
for many applications, the most prominent being routing. 
Distance-based locality-sensitive approach 1 39 1 is the root 
of most modern routing algorithms. As shown in |40|, per- 
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formance parameters of these algorithms depend strongly 
on the distance distribution in a network. In particular, short 
average distance and narrow distance distribution width 
break the efficiency of traditional hierarchical routing. They 
are among the root causes of interdomain routing scalability 
issues in the Internet today. 

Distance distribution also plays a vital role in robustness 
of the network to viruses. Viruses can quickly contaminate 
larger portions of a network that has small distances be- 
tween nodes. Topology models that accurately reproduce 
observed distance distribution will benefit researchers, who 
are developing techniques to quarantine the network from 
viruses. Finally, expansion from the seminal paper |3|, 
identified as a critical metric for topology comparison anal- 
ysis, is a renormalized version of distance distribution. 

Discussion. Interestingly enough, although the distance 
distribution is a "global" topology characteristic, we can ex- 
plain Figure[6]by the interplay between our local connectiv- 
ity characteristics: the k- and r-orders. First, we note that 
the skitter graph stands out in Figure[6]as it has the smallest 
average distance and the smallest distribution width (cf. Ta- 
ble QJ. This result appears unexpected at first since the 
skitter graph has more nodes than the WHOIS graph and 



only about half the number of links. One would expect a 
denser graph (WHOIS) to have a lower average distance 
since adding links to a graph can only decrease the aver- 
age distance in it. Surprisingly, the average distance of the 
most richly connected (highest k) WHOIS graph is not the 
lowest. This result can be explained using the r-order. In- 
deed, a more disassortative graph has a greater proportion 
of radial links, shortening the distance from the fringe to the 
core. 7 The skitter graph has the right balance between the 
relative number of links k and their radiality r, that mini- 
mizes the average distance. Compared to skitter, the BGP 
graph has larger distance because it is sparser (lower k), and 
the WHOIS graph has larger distance because it is more as- 
sortative (higher r). 

The fact that 62% of AS paths in the skitter graph are 
3-hop paths suggests the most frequent path pattern reflect- 
ing the customer-provider AS hierarchy: source's AS in the 
fringe — > source's provider AS in the core — ► destination's 
provider AS in the core — > destination's AS in the fringe. 

Another important observation is that for all three graphs, 
including WHOIS, the average distance as a function of 



'Henceforth, we use terms fringe and core to mean zones in the graph 
with low- and high-degree nodes respectively. 
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node degree exhibits relatively stable power laws in the full 
range of node degrees (Figure 0, with exponents given in 
Tabled] 

Both the eccentricity distribution e(x) (Figure[9) and av- 
erage eccentricity from fc-degree nodes e(k) (Figure 1 1 Oi 
are similar to their averaged distance counterparts. Table|3] 
also shows diameter, radius and average eccentricity for our 
graphs, as well as the relative size of graph center and pe- 
riphery. In the WHOIS graph, the center consists of only 
one AS, AS702 (UUNET), uniquely positioned to have the 
minimum eccentricity of 4. If we add the nodes having ec- 
centricity of 5, the center would consist of 1109 ASs, the 
center size ratio nn/n = 0.15 would become the largest 
among all three graphs, and it would be in the expected Re- 
order. 

3.8 Betweenness 

Average distance is a good node centrality measure: in- 
tuitively, nodes with smaller average distances are closer 
to the graph "center." However, the most commonly used 
measure of centrality is betweenness. It is applicable not 
only to nodes, but also to links. 

Definition. Let cry be the number of shortest paths 
between nodes i and j and I be either a node or link. 
Let <Jij{l) be the number of shortest paths between i 
and j going through node (or link) I. Its betweenness 
is B[ — <?ij (0 1 Vij ■ The maximum possible value for 
node and link betweenness is n(n — 1) 1251 . therefore in 
order to compare betweenness in graphs of different sizes, 
we normalize it by n(n — 1). 

Importance. Betweenness measures the number of short- 
est paths passing through a node or link and, thus, esti- 
mates the potential traffic load on this node/link assuming 
uniformly distributed traffic following shortest paths. 8 Be- 
tweenness is important for traffic engineering applications 
that try to estimate potential traffic load on nodes/links and 
potential congestion points in a given topology. Between- 
ness is also critical for evaluating the accuracy of topology 
sampling by traceroute-like probes (e.g. skitter and BGP). 
As shown in 1 25 1, the broader the betweenness distribution, 
the higher the statistical accuracy of the sampled graph. The 
exploration process statistically focuses on nodes/links with 
high betweenness thus providing an accurate sampling of 
the distribution tail and capturing relevant statistical infor- 
mation. Finally we note that link value, used in 1 3] to ana- 
lyze the topology hierarchy, and router utilization, used 1 4 1 
to measure network performance, are both directly related 
to betweenness. 

Discussion. The simplest approach to calculating node 
betweenness results in long running times, but we used an 
efficient algorithm from |41|. We also modified it to also 
compute link betweenness. For skitter and BGP graphs, 
node betweenness is a growing power-law function of node 

8 In fact, some variants of betweenness are just called load [41 ]. 



degree (Figure [8} with exponents given in Table [3] The 
WHOIS graph has an excess of medium degree nodes 
(cf. Figure [0 leading to greater path diversity and, hence, 
to lower betweenness values for these nodes. We also cal- 
culate average link betweenness as a function of degrees of 
nodes adjacent to a link B(ki, fe) (FigurelTTV Contrary to 
popular belief, the contour plots show that link betweenness 
does not measure link centrality. First, betweenness of links 
adjacent to low-degree nodes (the left and bottom sides of 
the plots) is not the minimum. In fact, non-normalized be- 
tweenness of links adjacent to 1 -degree nodes is constant 
and equal ton — 1 (the number of destinations in the rest of 
the network). Similar values of betweenness characterize 
links elsewhere in the graph, including radial links between 
high and low-to-medium degree nodes and tangential links 
in the zone of medium-to-high degrees (diagonal zone from 
bottom-right to upper-left). Second, while the maximum- 
betweenness links are between high-degree nodes as ex- 
pected (the upper right corner of the plots), the minimum- 
betweenness links are tangential in the medium-to-low de- 
gree zone (diagonal areas of low values from bottom-left to 
upper-right). We can explain the latter observation by the 
following argument. Let i and j be two nodes connected by 
a minimum-betweenness link /. The only shortest paths go- 
ing through I are those between nodes that are below i and j, 
where "below" means further from the core and closer to the 
fringe. When the degrees of both i and j are small, the num- 
bers of nodes below them (with lower degree) are small, 
too. Consequently, the number of shortest paths, propor- 
tional to the product of the number of nodes below i and j, 
attains its minimum at I. We conclude that link between- 
ness is not a measure of centrality but a measure of some 
combination of link centrality and radiality. 

3.9 Spectrum 

Definition. Let a be the adjacency matrix of a graph. 
This n x n matrix is constructed by setting the value of its 
element eijj = aji = 1 if there is a link between nodes i 
and j. All other elements have value 0. Scalar A and vec- 
tor v are the eigenvalue and eigenvector respectively of a 
if av = Xv. The spectrum of a graph is the set of eigenval- 
ues of its adjacency matrix. 

Importance. We stress that spectrum is one of the most 
important global characteristics of the topology. Spectrum 
yields tight bounds for a wide range of critical graph char- 
acteristics [42 1, such as distance-related parameters, expan- 
sion properties, and values related to separator problems 
estimating graph resilience under node/link removal. The 
largest eigenvalues are particularly important. Most net- 
works with high largest eigenvalues have small diameter, 
expand faster, and are more robust. To further emphasize 
the importance of spectrum, we consider the following two 
specific examples of spectrum-related metrics that played a 
central role in two significant contributions to networking 
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Fi gure 11: Logarithm of normalized link betweenness B(ki,k2)/n/(n — 1) on a log-log scale. 
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Figure 12: Spectrum. Absolute values of top 10% of eigenvalues or- 
dered by their normalized rank: the absolute value divided by the total 
number of eigenvalues calculated for a given graph. 



topology research. 

First, Tangmunarunkit et al. |3| defined network re- 
silience, one of the three metrics critical for their topology 
comparison analysis, as a measure of network robustness 
under link removal, which equals the minimum balanced 
cut size of a graph. By this definition, resilience is related 
to spectrum since the graph's largest eigenvalues provide 
bounds on network robustness with respect to both link and 
node removals [42 1. 

Second, Li et al. |4] define network performance, one of 
the two metrics critical for their HOT argument, as the max- 
imum traffic throughput of the network. By this definition, 
performance is related to spectrum since it is essentially the 
network conductance 1 43 1 . It can be tightly estimated by the 
gap between the first and second largest eigenvalues [42 1. 

Beyond its significance for network robustness and per- 
formance, the graph's largest eigenvalues are important for 
traffic engineering purposes since graphs with larger eigen- 
values have, in general, more node- and link-disjoint paths 



to choose from. The spectral analysis of graphs is also a 
powerful tool for detailed investigation of network struc- 
ture H441 1451 , such as discovering clusters of highly inter- 
connected nodes, and can reveal the hierarchy of ASes in 
the Internet B31 . 

Discussion. Our fc-order (BGP, skitter, WHOIS) plays a 
key role once again: the densest graph, WHOIS is on the 
top in Figure[21and its first eigenvalue is largest in Table|3] 
The eigenvalue distributions of all the three graphs follow 
power laws. 

Other important metrics such as coreness and eccentric- 
ity are explained in detail in the Supplement [ 15 1. As with 
other metrics, the resulting metric values and differences in 
the three data sources can be explained using fc-order and 
r-order. 



4 Observed topologies vs. random graph 
models 

So far we have looked at metrics that provide important de- 
tails about the Internet AS-graph. These metrics directly 
impact network applications and protocols, and can also 
be used to distinguish between different topologies. Us- 
ing JDD, which determines both fc-order and r-order, we 
have been able to account for the differences and peculiar- 
ities in our target data sets. We next consider models that 
aim to reproduce observed topologies. In this section, we 
consider different classes of random graphs and discuss the 
relationship between these theoretical models and the Inter- 
net graphs we constructed from measurements. This anal- 
ysis will help determine how close random graph models 
come to capturing measured Internet topologies. 

4.1 Random graph models 

Topology generators and models have been evolving 
steadily in the past few years. The simplest model mim- 
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icked the average degree observed in the topology. Given 
the number of nodes n and edges m (e.g. in the original 
graph), and, consequently, the average degree k = 2m/ n, 
one can construct the class of maximally random graphs 
having the same average degree k by connecting every 
pair of nodes with probability p — k/n. These graphs 
belong to the class of classical (Erdos-Renyi) random 
graphs G n , P 11461 . In this paper we call such graphs 0K- 
random conceptualizing them as a zero-order approxima- 
tion to the connectivity in the original graph. (We ex- 
plain the exact semantics behind this terminology at the 
end of this section.) In general, OK-random graphs fail 
to approximate real Internet topologies. In particular, the 
node degree distribution in OK-random graphs is bino- 
mial, which is closely approximated by Poisson distribution 
P 0K (k) = e~ k k k /k\ 0. It is different from power-law de- 
gree distributions observed in the Internet. 

The next model remedied this deficiency by capturing the 
degree distribution of the nodes. Given a specific form of 
the degree distribution P(k) (e.g. extracted from the origi- 
nal graph), one can construct the class of maximally random 
graphs having the same degree distribution following, for 
example, a recipe introduced in [47 48 1 and further formal- 
ized in [49 1 . We call such graphs lK-random, and we can 
think of them as providing the first-order approximation to 
the connectivity of the original graph. Of particular interest 
for Internet modeling is the case when P(k) is a power- 
law function (Q. The resulting sub-class of lK-random 
graphs is called power-law random graphs (PLRG). Note 
that the lK-random graphs have a specific form of the 
JDD P(fci, k 2 ) 0. If we denote by P(k) the probability 
that one of the two nodes adjacent to a randomly selected 
edge is of degree k, P(k) = (k/k)P(k), then the JDD in 
lK-random graphs is Puc(kx, A^) = P(ki)P(k2), mean- 
ing that there is no correlation between degrees of adjacent 
nodes. This is why lK-random graphs are also called un- 
correlated graphs. By construction |46|, OK-random graphs 
are also uncorrelated, with their JDD Pqk (hi , fe) given by 
the same expression as above, where P(k) is the Poisson 
distribution Poi((k). 

We now define a model that provides the next level 
of approximation: 2K-random graphs, which are maxi- 
mally random graphs reproducing the given JDD P(k\ , fo). 
These graphs have the exact JDD as the original topology, 
but are random in all other respects. The semantics behind 
the "dK-random" notation becomes clear now: d in "dK- 
random" is the number of arguments in the degree distribu- 
tion function P(ki, k%, ■ ■ ■ , kd) that the dK-random graphs 
reproduce. 

4.2 Comparison with observed topologies 

As demonstrated in Q, lK-random graphs produced by 
PLRG-based topology generators produce more accurate 
approximations of the Internet topology than outputs of 



older topology generators designed to simulate the per- 
ceived hierarchical structure of the Internet. We show that 
the topology generation strategy based on modeling only 
the degree distribution fails to attain the level of accuracy 
required in the description of Internet topology. Li et al. 
1 4 1 have shown that graphs with the same degree distribu- 
tion can have different structures. In section l4~.2.1l we com- 
pare the JDD of lK-random graphs to the JDD observed 
in the measured data and show how they are different. As 
a next step, we also show how 2K-random graph models 
better approximate the real topologies. 

4.2.1 Joint degree distribution 

For each of our graphs, we consider its lK-random coun- 
terpart reproducing P(k) of the given graph. We calculate 
the JDD of the model and compare it with the actual JDDs 
of our graphs (Figure [O}. 

The lK-random graph generated from skitter's node de- 
gree distribution (Figure |l3(a)F has the smallest frequency 
of tangential links interconnecting medium-degree nodes 
(the minimum in the center of the plot). The most frequent 
links are either radial (bottom-right and top-left corners) or 
low-degree tangential (bottom-left corner). The ratio of the 
actual JDD of the skitter graph to this model (Figure [T3(b)F 
shows that the real skitter topology is quite different from 
its lK-random version. The actual skitter graph exhibits 
a relative deficiency of links in the core and in the fringe 
(minimum of the ratio in the top-right and bottom-left cor- 
ners). At the same time, it has a relative excess of radial 
links (bottom-right and top-left corners) and of tangential 
links in the medium-degree zone (the center of the plot). 

The ratio of the BGP graph JDD to its lK-random coun- 
terpart is similar to skitter ratio, but the excess of radial links 
is less prominent (Figure [T 3 (c)\ . The ratio of the WHOIS 
graph JDD to its lK-random model is less variable (Fig- 
ure | 13(d)} showing that the WHOIS graph is closer to being 
lK-random than the other two graphs. 

We now turn our attention to other JDD-derived statis- 
tics (cf. Section 13 .31 . The assortativity coefficient of un- 
correlated IK- and OK-random graphs is r = and that 
their average neighbor connectivity k nn (k) is a constant 
function of node degree k 0. For lK-random graphs, it 
is knn(k) — (k 2 )/k, where (fc 2 ) denotes the second mo- 
ment of the degree distribution. For OK-random graphs, the 
expression is: k^ (k) = k + 1. While all three of our data 
sources yield disassortative graphs with r < 0, the assorta- 
tivity coefficient of the WHOIS graph is closest to (cf. Ta- 
ble |3}. Its average neighbor degree k nn (Jz) varies within a 
factor of 2. In contrast, the average neighbor degree of the 
other two graphs varies by two orders of magnitude (cf. Fig- 
ure|2j- These observations again point out that the WHOIS 
graph is the closest to being lK-random. Note, however, 
that PLRG-generated graphs B9l [3) cannot accurately ap- 
proximate the WHOIS topology since its degree distribu- 
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Fi gure 13: Comparison of graphs with lK-random models, a) The contour plot of the logarithm of the joint degree distribution -Pik(^I) ^2) 
for a lK-random graph having the skitter degree distribution P(k). b) The logarithm of the ratio of P(fei, te) observed in the real skitter graph to its 
simulated P\n{k\, fo). c, d) The plots, analogous to (b), for BGP and WHOIS graphs. Some asymmetry of the diagrams is due to interpolation and 
rounding algorithms in MATLAB. The scatter plots in the Supplement 1151 are symmetric. 



tion does not follow power-law. 

The skitter graph is on the other extreme: it is the 
most disassortative (the smallest value of r) and its average 
neighbor degree k nn (k) has the sharpest decline (the largest 
value of exponent 7„„ of the power-law fit of k nn (k)). In 
other words, even though this graph has a power-law de- 
gree distribution, the lK-random (PLRG) model cannot ac- 
curately approximate it either. 

4.2.2 Clustering 

In this section, we focus on how clustering can be used 
to verify the accuracy of topology models. Uncorre- 
lated graphs have not only constant average neighbor con- 
nectivity but also constant clustering. For lK-random 
graphs, it is: C\k = ({k 2 ) -P)/(nF), while for 0K- 
random graphs, we have Cok — k/n\9\. Dorogovtsev |50| 
showed that the 2K-random graphs have a specific form 
of local clustering C2x{k) and derived expressions for 
mean local clustering C^k and clustering coefficient C^k 
(Eqs. (8), (9), and (10) in |50|, correspondingly). 

We compare clustering observed in our three Internet 
graphs with the predicted values for different graph mod- 
els (Figure I14l> . In the skitter and BGP cases, the local 
clustering function C^K^k) calculated for the 2K-random 
model follows, albeit shifted down, the form of actually ob- 
served clustering C(k). The ratio of corresponding mean 
values C2K/C is 0.8 for the skitter graph and 0.7 for the 
BGP graph. In the WHOIS case, the functional behavior of 
the model and of the observed clustering are different, and 
the ratio of their mean values is 0.25. We conclude that, 
using the metric of clustering, the skitter graph is closest to 
being 2K-random, while the WHOIS graph is the furthest. 
This finding has a direct impact on topology generators: it 
implies that the skitter topology can be successfully recre- 
ated by capturing the JDD observed in the measured topol- 



ogy. We surmise that a 2K-random generator will closely 
approximate the skitter graph. Similarly, a 2K-random gen- 
erator reproducing the JDD observed in the measured BGP 
graph will be able to create an approximate model of the 
BGP graph. 

Figure [21 also shows the constant values of local clus- 
tering predicted by the corresponding IK- and OK-random 
graph models, C\k (solid line) and Cok (dash-dotted line). 
Naturally, the lK-random graphs, with a constant form of 
local clustering, less accurately describe the observed clus- 
tering than 2K-random model, except in the WHOIS case, 
which is closest to being lK-random. Clustering in the OK- 
random graphs is even further away, being orders of magni- 
tude smaller than the clustering observed in all three graphs. 
Note that the ratio of Cok / C\k is an indirect indicator of a 
graph's proximity to being OK-random. The Cok /C\k val- 
ues for our graphs (1 • 10~ 2 for the WHOIS, 6 • 10~ 4 for the 
skitter, and 3 • 10~ 4 for the BGP) indicate that the WHOIS 
graph is better approximated by OK-random model, com- 
pared with the other two graphs. The BGP graph is the least 
OK-random in that respect. 

In summary, the 2K-random graph model approximates 
the skitter topology best, while the PLRG generator is infe- 
rior for all the three graphs. 

5 Limitations 

Our work suffers from a number of methodological limita- 
tions and biases. We discuss each in turn below along with 
the potential consequences. 

We have tried to be exhaustive while compiling our list 
of graph metrics considered by the community. However, 
it is possible that we may have missed important metrics or 
that additional important metrics may be proposed that are 
not well captured by, for instance, joint degree distribution. 

Another limitation is our available data. Although the 
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Figure 14: Local clustering vs. graph randomness. Squares show local clustering observed in the real topology. The dashed line is its mean value. 
Crosses show local clustering predicted by the 2K-random graph model Cznik). The dotted line is its mean value. The solid and dash-dotted lines are 
constant clusterings predicted by the IK- and OK-random graph models. 



data sets we examine represent the current state of the art in 
macroscopic AS topology, they are incomplete and indirect 
reflections of the underlying topology. They require pro- 
cessing before producing the desired AS graph. For all our 
data sets, researchers must make choices while dealing with 
ambiguities and errors in the raw data. One such example 
is the detection of "false" links created by route changes in 
traceroute data. This paper does not address how different 
choices in processing the original data may result in differ- 
ent values for our target metrics. Instead we have attempted, 
where possible, to use best practices to extract topologies 
as presented in papers and in our discussions with other re- 
searchers. 



Next, we limit our data collection to a single month for 
obtaining skitter and BGP data. While we believe that our 
results will hold true for historical data and are not an arti- 
fact of the current Internet or our sampling period, we leave 
this study to future work. 

Finally, we come to the role played by JDD in topological 
studies. JDD has successfully explained the resulting met- 
ric values as well as inherent differences in skitter, BGP and 
WHOIS graphs. As a next step, we compare clustering in 
our observed topologies to the predicted clustering values in 
the 2K-random graph. The proximity between the observed 
and predicted models gives us further reason to believe that 
graphs generated by the 2K-random model come close to 
the original topology. Ideally, we could use a graph gen- 
erator that uses the measured JDD of a graph to produce 
random graphs with similar JDDs, which in turn would also 
display similar values for a variety of important graph met- 
rics. We leave such a potential demonstration of the value 
of JDD for capturing a broad range of graph characteristics 
to future work. 



6 Analysis and Conclusions 

We discussed the properties of Internet AS-level topolo- 
gies extracted from the three most popular sources of AS 
topology data: skitter measurements, BGP tables, and the 
RIPE WHOIS database. We compared the derived topolo- 
gies based on a set of important and frequently used statis- 
tical characteristics. 

We further presented a detailed comparison of widely 
available sources of topology data in terms of a number of 
popular metrics studied in the literature. Of the set of met- 
rics we considered, the joint degree distribution P(k\, fc 2 ) 
embeds the most information about a graph, since this dis- 
tribution determines both the average node degree k and the 
assortativity coefficient r. We find that, for the data sources 
we consider, a 2K-random model reproducing the JDD of 
the original topology also captures other crucial topolog- 
ical characteristics. While additional work is required to 
verify this claim, we believe that JDD may be a powerful 
metric for capturing a variety of important graph proper- 
ties. Isolating such a metric or small set of metrics is a 
prerequisite to developing a accurate topology generators 
to assist a broad array of research and development efforts. 
Developing such a JDD-based topology generator and fur- 
ther demonstrating this concept is the subject of our current 
research. 

We also propose criteria to evaluate how well the random 
graph models reproducing the average node degree k (OK- 
random), the degree distribution P(k) (lK-random), or the 
JDD P(ki, k%) (2K-random) approximate characteristics of 
the observed topologies. Using clustering as a measure of 
accuracy of the 2K-random approximation, we find that the 
2K-random model describes the skitter graph most accu- 
rately. Using the assortativity coefficient (calculated from 
the JDD) as a measure of accuracy of the lK-random ap- 
proximation, we find that IK- or OK-random graph descrip- 
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tions best fit the WHOIS graph, but are less successful in 
the skitter and BGP cases. The latter fact implies that the 
power law random graph (PLRG) model (which is a special 
case of lK-random models) and topology generators based 
on it fail to accurately capture the important properties of 
the skitter or BGP graphs. Similarly, the PLRG model fails 
to recreate the WHOIS graph since its node degree distribu- 
tion does not follow a power law at all. 

Finally, one may ask which data source is closest to re- 
ality. We emphasize that there is not one but at least three 
data sources of the Internet AS-level topology: skitter, BGP, 
and WHOIS data, and that the resulting graphs present dif- 
ferent views of the Internet. The skitter graph closely re- 
flects the topology of actual Internet traffic flows, i.e. the 
data plane. The BGP graph reveals the topology seen by 
the routing system, i.e. the control plane. Naturally, these 
two topologies are somewhat different. Understanding their 
incongruities is a subject of ongoing research B51lll8ll52l . 
The WHOIS graph represents a record of the Internet topol- 
ogy created by human actions, i.e. the management plane. 
It is not surprising that this human-generated view of the 
Internet has different topological properties than the other 
two graphs. The observed abundance of tangential links 
between ASes is likely to reflect unintentional or even in- 
tentional over-reporting by some providers of their peering 
arrangements. 

Our analysis should arm researchers with better insights 
into specifics of each topology. We hope that our study 
encourages the validation of existing models against real 
data and also motivates t he development of better topology 
models. 
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